Sequence matching algorithms and pairing of noncoding RNAs 



S.K. Nechaev, 1 ' 2 ' 3 M.V. Tamm, 4 and O.V. Valba 5 

1 LPTMS, Universite Paris Sud, 91405 Orsay Cedex, France 

2 P.N. Lebedev Physical Institute of the Russian Academy of Sciences, 119991, Moscow, Russia 

3 J.-V. Poncelet Labotatory, Independent University, 119002, Moscow, Russia 

^Physics Department, Moscow State University, 119992, Moscow, Russia 

5 Moscow Institute of Physics and Technology, 141700, Dolgoprudny, Russia 

(Dated: November 12, 2010) 

A new statistical method of alignment of two heteropolymers which can form hier- 
O ■ archical cloverleaf-like secondary structures is proposed. This offers a new construc- 

tive algorithm for quantitative determination of binding free energy of two noncoding 
RNAs with arbitrary primary sequences. The alignment of ncRNAs differs from the 
complete alignment of two RNA sequences: in ncRNA case we align only the se- 
quences of nucleotides which constitute pairs between two different RNAs, while the 
secondary structure of each RNA comes into play only by the combinatorial factors 
affecting the entropc contribution of each molecule to the total cost function. The 
proposed algorithm is based on two observations: i) the standard alignment problem 
is considered as a zero-temperature limit of a more general statistical problem of 
binding of two associating heteropolymer chains; ii) this last problem is generalized 
onto the sequences with hierarchical cloverleaf-like structures (i.e. of RNA-type). 
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. Taking zero-temperature limit at the very end we arrive at the desired "cost func- 

£N) ■ tion" of the system with account for entropy of side cactus-like loops. Moreover, 

we have demonstrated in detail how our algorithm enables to solve the "structure 
recovery" problem. Namely, we can predict in zero-temperature limit the cloverleaf- 
like (i.e. secondary) structure of interacting ncRNAs by knowing only their primary 
sequences. 
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I. INTRODUCTION 



A. Binding of noncoding RNAs 

According to a common definition, the noncoding RNA (ncRNA) is an RNA molecule that 
is not translated into a protein. The ncRNAs either regulate the gene expression directly, 
for example by occupying the ribosome binding site, or indirectly providing RNA targeting 
specificity for a protein-based regulatory mechanism The class of ncRNAs spreads 
on regulatory and functional RNAs. In modern classification the term "noncoding RNA" is 
basically attributed to eucariotic RNAs, sometimes called also "small nonmessenger RNAs" . 



In general, regulatory RNAs act in the cell by one of two basic mec 



ranisms: by base-pairing 



interactions with other nucleic acids, or by binding to proteins [2|. The base pairing with 
target molecules constitutes the typical mechanism, by which the ncRNA regulates the gene 
expression. The base pairing is subdivided in two classes depending on their locations: 
ds-encoded ncRNAs are placed at the same genetic location but on the strand opposite to 
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the target RNA, and trans-encoded ncRNAs are placed at a chromosomial location distinct 
from the target RNA. 

It should be noted however that the direct gene regulation of ncRNA via specific base 
pairing is not completely understood. One of few known examples concerns the partici- 
pation of the Xist gene in X-inactivation jij]. X-inactivation (also called lyonization) is a 
process by which one of two copies of X chromosome present in female mammals is inacti- 
vated. The inactive X chromosome is silenced by packaging into transcriptionally inactive 
heterochromatin. The Xist gene exhibits properties of the X-inactivation center and Xist 
ncRNA becomes localized close to the autosome into which the gene is integrated 

Since base-pairing of noncoding and target RNAs plays such important biological role, it 
is worth estimating theoretically the binding free energy of the ncRNA-target RNA complex 
by knowing the primary structures of each macromolecule. This problem resembles the 
alignment of two RNA sequences with one principal difference: in ncRNA case we align only 
the sequences of nucleotides which constitute pairs between two RNAs, while the secondary 
structure of each RNA comes into play only by the combinatorial factors affecting the entropc 
contribution to the total cost function. 

One of the key problems in computational ncRNA genefinding is to predict RNA tran- 
script initiation, termination, and processing. However, accurate prediction of even simple 
transcription units remains an open question - see, for example, the minireview 

In brief, the main goal of this work consists in developing a constructive method to build 
a "cost function", which characterizes matching (alignment) of two noncoding RNAs with 
arbitrary primary sequences. 



B. Noncoding RNAs as particular class of associating heteropolymers 

To put problem of alignment of ncRNAs into the context of statistical mechanics, it seems 
desirable to extract the basic features of ncRNAs which would play the major role in our 
analysis. The ncRNAs are the particular examples of a wide class of so-called "associating" 
heteropolymers. 

Generally, associating polymers, besides the strong covalent interactions responsible 
for the frozen primary sequence of monomer units, are capable of forming additional 
weaker reversible temperature-dependent (i.e. "thermoreversible" ) bonds between different 
monomers. Many biologically important macromolecules, like proteins and nucleic acids, 
belong to the class of associating polymers jsj. 

For associating polymers the variety of possible thermodynamic states and ternary struc- 
tures is determined by the interplay between the following three major factors: i) the energy 
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gain due to the direct "pairing", i.e. formation of thermoreversible contacts; ii) the com- 
binatoric entropy due to the choice of which particular monomers (among those able to 
participate in bonds formation) do actually create bonds; iii) the loss of conformational 
entropy of the polymer chain due to pairing (and in particular, the entropic penalty of loop 
creation between two paired monomers). 

Among a variety of macromolecular systems with thermoreversible pairing we pay a spe- 
cial attention to a class of RNA-like polymers. These polymers are distinguished from other 
biologically active associating polymers, such as, for instance, proteins, by a capability of 
forming hierarchical "cloverleaf-like" (or "cactus-like") secondary structures. The forma- 
tion of a thermoreversible contact between two distant bonds in a RNA molecule (or in 
a single-stranded DNA) imposes a nonlocal constraint on a number of unpaired possible 
conformations: all bonds in a RNA chain are known to be arranged in a way to allow only 
hierarchical cactus-like folded conformations topologically isomorphic to a tree. The pairs 
of bonds, which do not obey such a structure are called "pseudoknots" . In most cases they 
are forbidden for RNA molecules. We shall accept the absence of pseudoknots as a matter 
of fact. Let us note however that in the work [6| the dynamic programming algorithm has 
been developed for predicting optimal RNA secondary structure, including pseudoknots. 

Being formulated in statistical terms, the main goal of our work can be rephrased as 
follows. We propose a new efficient and statistically justified algorithm for the determination 
of the binding free energy of any two primary heteropolymer sequences under the supposition 
that each sequence can form a hierarchical cactus-like secondary structure, typical for RNA 
molecules. 

C. Pairing vs alignment 

Let us reveal the similarities and differences between computations of the free energy of 
associating heteropolymer complexes and standard matching algorithms. 

The matching (or "alignment" ) problem, even for linear structures is one of the key tasks 
of computational evolutionary biology. In particular, one of the most important applica- 
tions of Longest Common Subsequence (LCS) search in linear structures is a quantitative 
definition of a "closeness" of two DNA sequences. Such a comparison provides information 
about how far, in evolutionary terms, two genes of one parent have deviated from each other. 
Also, when a new DNA molecule is sequenced in vitro, it is important to know whether it is 
really new or it is similar to already existing molecules. This is achieved quantitatively by 
measuring the LCS of the new molecule with other ones available from databases. 

The task of the present work consists of extending the statistical approach developed for 
alignment of linear sequences to the computation of pairing free energy of two RNA-type 
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structures. The target object of our approach is a ground state free energy as complexes nc 
RNA - target RNA, or ncRNA - DNA. 



II. THEORETICAL BACKGROUND 

A. Alignment of linear sequences 

Recall that the problem of finding the LCS of a pair of linear sequences drawn from the 
alphabet of c letters is formulated as follows. Consider two sequences a = {a±, a>2, ■ ■ ■ , «m} 
(of length m) and j3 = {(3i, (3 2 , . . . , n } (of length n). For example, let a and (3 be two random 
sequences of c = 4 base pairs A, C, G, T of a DNA molecule, e.g., a = {A, C, G, C, T, A, C} 
with m = 6 and (3 = {C, T, G, A, C} with n — 5. Any subsequence of a (or (3) is an ordered 
sublist of a (and of j3) entries which need not to be consecutive, e.g, it could be {C, G, T, C}, 
but not {T, G, C}. A common subsequence of two sequences a and (3 is a subsequence of 
both of them. For example, the subsequence {C, G, A, C} is a common subsequence of both 
a and (3. There are many possible common subsequences of a pair of initial sequences. The 
aim of the LCS problem is to find the longest of them. This problem and its variants have 



been widely studied in biology [l-10|, c omp uter science [ll 14 ] , probability theory (ls) 
and more recently in statistical physics 15|, I22M24 1 . 



21 



The basis of dynamic programming algorithms for comparing genetic sequences has been 



formulated for the first time in 25] (see also 26[). In general setting this algorithm takes 



into account the number of perfect matches and the difference between mismatches and 
gaps. Being formulated in statistical terms, it consists in constructing the "cost function", 



F, having a meaning of an energy (see, for example 27], |28| for details) 



F = A match + fi N mis + 5 A gap (1) 

In Eq.flTJ Amatch, A mis and A gap are correspondingly the numbers of matches, mismatches and 
gaps in a given pair of sequences, and \i and 5 are respectively the energies of mismatches 
and gaps. Without the loss of generality, the energy of matches can be always set to 1. 
Besides Eq.flT]) we have an obvious conservation law 

n + m = 2A match + 2A mis + A gap (2) 

which allows one to exclude A gap from Eq.([T]) and rewrite it as follows: 

F = A ma t c h + /iA mis + 5(n + m-2A matc h-2A mis ) = (1 - 2<5)A matc h + (/i-25)A mis + const (3) 

In Eq.()3]) the irrelevant constant 8(n + m) can be dropped out. 

Adopting (1 — 25) as a unit of energy, we arrive at the following expression 

F = A match + 7 A mis (4) 
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where 



7 



H-28 



(5) 



1-25 

and 7 < 1 by definition. The interesting region is < 7 < 1, otherwise there are no 
mismatches at all in the ground state (i.e., there is no difference between 7 = 0, which 
corresponds to simplest version of the LCS problem, and 7 < 0). 

It is known 



27 



that the maximal cost function 



F max = max[iV match + 7iV mis ] 
can be computed recursively using the "dynamic programming" 



F™: = max 



with 



Cm,n 



nmax nmax pmax . /■ 

r m-\,ni r m,n-\i r m— l,n— 1 ' Sm,n 



1 in case of match 
7 in case of mismatch 



(6) 



(7) 



(8) 
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In our previous studies of matching statistics in linear sequences we have shown in 
that properly normalized asymptotic distribution of the LCS in a somewhat simplified ver- 
sion of the problem, known in literature as a "Bernoulli model", is given by the so-called 
Tracy-Widom distribution, first derived for the distribution of the highest eigenvalues of 



random matrices belonging to the Gaussian ensemble 30_|, |31 |. 



B. Matching vs pairing of two random linear heteropolymers 



Consider the auxiliary statistical model describing the formation of a complex of two 
heteropolymer linear chains with arbitrary primary sequences. Let these chains be of lengths 
Li = mi and L 2 = ni correspondingly. In what follows we shall measure the lengths of the 
chains in number of monomers, m and n, supposing that the size of an elementary unit, £, 
is equal to 1. Every monomer can be chosen from a set of c different types A, B, C, D, 
... . Monomers of the first chain could form saturating reversible bonds with monomers 
of the second chain. The term "saturating" means that any monomer can form a bond 
with at most one monomer of the other chain. The bonds between similar types (like A-A, 
B B, C-C, etc.) have the attraction energy u and are called below "matches", while the 
bonds between different t ype s (like A-B, A-D, B-D, etc.) have the attraction energy v and 
are called "mismatches" 38] . Suppose also that some parts of the chains can form loops. 
These loops obviously produce "gaps" since the monomers inside the loops of one chain have 
no matching (or mismatching) counterparts in the other chain. Schematically a particular 
configuration of the system under consideration for c = 2 is shown in Figf]] 
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chain 1 
chain 2 




Figure 1: Schematic picture of a complex of two random linear heteropolymer chains with two 
types of letters (c = 2). 



Our aim is to compute the free energy of the described model at sufficiently low tempera- 
tures under the supposition that the entropic contribution of the loop formation is negligible 
compared to the energetic part of the direct interactions between chain monomers. Let G m>n 
be the partition function of such a complex; G m ^ n is the sum over all possible arrangements 
of bonds. In the low-temperature limit we can write G m ^ n recursively: 

m,n 

G m ,n = 1 + 2_i ^i-lj'-l 

(9) 

G m fl = 1; Go :n = 1; Go,0 = 1 

The meaning of the equation (Q is straightforward. Starting from, say, the left ends of the 
chains shown in Fig{T]we nn d the first actually existing contact between the monomers i (of 
the first chain) and j (of the second chain) and sum over all possible arrangements of this 
first contact. The first term "1" in (j^J) means that we have not found any contact at all. 
The entries (1 < i < m, 1 < j < n) are the statistical weights of the bonds which are 
encoded in a contact map {/?}: 

{B + = e u l T monomers i and j match 
(10) 
f3 = e v/ monomers i and j do not match 

The straightforward computation shows that the partition function G m>n (^j obeys the 
following exact local recursion 

Note that if Pij = 2 for all 1 < i < m and 1 < j < n, the recursion relation (TTTT) generates 



the so-called Delannoy numbers 33]. 



8 S.K. Nechaev, M.V. Tamm, O.V. Valba 



Let us point out that since we are working at finite temperatures, the account for "loop 
factors" is desirable. Under the "loop factor" we understand the entropic contribution to the 
free energy of the entire system coming from the fluctuations of parts of heteropolymer chains 
between successive contacts. Obviously, in the zero-temperature limit these fluctuations 
vanish. 

Write the partition function G m ^ n as G m ,n = exp{F m n /T}, where —F n ^ m and T are the 
free energy and the temperature of the complex of two heterogeneous chains of lengths m 
and n. Considering the T — > limit of the equation (Jlip . we get 

F 



lim Tin ( e F ^-^/ T + e ^,»-i/ T + (R _ i) e Fm-i, n -i/T\ , u \ 
r->o V / 



which can be regarded as an equation for the ground state energy of a chain. The expression 
([13]) reads 

Fm,n max [T m _x n , F m n —\^ F m —\ n —\ -\- T] m ,n\ (1^) 

where 

{r?+ = Tln(e"/ T -l) match 
, (14) 
7] =T\n(e v < —1) mismatch 

Indeed, the ground state energy (TT3T) may correspond either: (i) to the last two monomers 
connected, then the ground state energy equals -F™^* n _ x + (m,n, or (ii) to the unconnected 
end monomer of the fist (or second) chain, then the ground state energy is F^ a ^_ l (or F™\ n ). 

Taking r] + as the unit of the energy, rewrite (TT5]) in a form identical to the dynamic 
programming equation ([7]): 



F m „ = max 



'm-^ni F m ri —i, F m _i n _i -\- Tj m n (1^) 



with 



in case of match 

Vm,n = { if , . (16) 

= — m case of mismatch 

(compare to (JSJ). In the low-temperature limit the parameter a has simple expression in 
terms of coupling constants u and v. 

rf \n(e v / T - 1 



r] + ln(e"/ T - 

The initial conditions for F m ^ n are transformed into F 0tTl = F n = F Q = 



C. Matching vs pairing of two random RNA type heteropolymers 



Having the applications to RNA molecules in mind, assume that the structures formed 
by thermoreversible bonds of each chain are always of a cactus-like type, as shown in FigJ2h. 
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It means that we restrict ourselves to the situation in which the chain conformations with 
"pseudoknots" shown in Figj2b are prohibited. The difference between allowed and not 
allowed structures becomes more transparent, being redrawn in the following way. Represent 
a polymer under consideration as a straight line with active monomers situated along it 
in the natural order, and depict the bonds by dashed arcs connecting the corresponding 
monomers. Now, the absence of pseudoknots means the absence of intersection of the arcs 
- see the Fig|2fc,d. 




a bed 



Figure 2: (a,b): Schematic picture of allowed (a) cactus-like and prohibited (b) pseudoknot con- 
figurations of the bonds; (c,d): Arc diagrams corresponding respectively to configurations (a) and 
(b) (note the intersection of arcs in (d)). 



We assume for simplicity, that except pseudoknots, all other bond configurations are 
allowed. This means, in particular, that at the moment we do not require any minimal 



loop length, as well as we do not yet take into account the cooperativity effect 39|. These 



assumptions are known to be false for real RNA molecules (for example, there are no loops 



shorter than 3 monomers in RNA chains 34j). However, one can speculate that (see, for 



example, 35|) if the links of the chain are considered as renormalized quasi-monomers 
consisting of several "bare" units, these assumptions seem to be plausible. Nevertheless, 
in the last Section we study in detail the effect of minimal loop length on the structure 
formation. 

Let us remind that one of the main goals in this work consists in developing an algorithm 
for the computation of the cost function, which characterizes the similarity of two RNA- 
type random sequences. To succeed, we should incorporate in the conventional cost function 
discussed above the contribution coming from the entropy of different rearrangements of 
cactus-like conformations typical for RNA's. It is not obvious how to do that directly in 
the frameworks of the dynamic programming approach formalized in the recursion relation 



(J7J). To proceed, we exploit the idea (formulated for the first time in [32]), which consists of 
two consecutive steps: 



1. First of all, we reformulate the recursion relation ([7]) in terms natural for statistical 
mechanical consideration and show that ([7]) can be regarded as a relation for the free 
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energy of some statistical model describing the formation of a complex of two random 
heteropolymer linear chains in a zero-temperature limit; 

2. Secondly, we take into account the possibility for random heteropolymer chains to form 
complex spatial cactus-like structures and write the corresponding recursion relations 
for the partition function (but not for the free energy) at some temperature T not 
obliged to be zero. By taking the limit T — > at the very end we arrive at the desired 
cost function. 



The generic partition function G m ,n of a complex of two heteropolyers, where each of 
chains can form a cactus-like structure, shown in the FigfS^i, can be written in the form 
similar to (P): 



C - nW n( 2 ) + R r n (1) n {2) 



im—i Jn—i 

(18) 



G m fi — g m ; Go, n — Qn ; G^o — 1 



where gn and gff are the partition functions of individual chains. They satisfy the selfcon- 
sistent Dyson-type equation [3j 



36 



n-l n (1) 

rf' = i + E E & (T^zW * w = 1 < 19 > 

i=l j=i+l+£ u ' 

(the same equation should be written for )■ The Boltzmann weights • are the constants 
of self-association, which are, similarly to /3 m>n , variables encoded by some contact map 
and the denominator describes the contribution of the entropic "loop factor". The value 
a = 3/2 (considered throughout our paper) corresponds to the loop factor of ideal chains. 
The summation over j running from % + 1 + I till n ensures the absence of loops of lengths 
smaller than i = 3 monomers. The equation (fl9j) is schematically depicted in the Figj3j 




Y///////A - + 

Figure 3: Diagrammatic form of the Dyson-type equation for the partition function of an individual 
chain g n having cactus-like topology. 



Equations (fT5]l - (fl9]l constitute the analytical basis of our numerical studies and these 
equations are considered as a replacement of the dynamic programming algorithm for match- 
ing of sequences with RNA-type architecture. 
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III. MATCHING ALGORITHM FOR TWO NONCODING RNAS 

In this Section we describe an algorithm for computing the binding free energy (which 
plays a role of the cost function) for the pair of two noncoding RNAs. Let us remind that 
in ncRNA case we align only the sequences of nucleotides which constitute pairs between 
two different RNAs and the cactus-like secondary structure of each RNA contributes to the 
total cost function by corresponding entropic factors. 

Extrapolating the free energy of linear sequences to zero temperature we recover (for linear 
sequences only) the well-known standard dynamic programming algorithm described in 
(]15]) -(ll7 p . For cactus-like structures our algorithm is not reduced (even at zero temperature) 
to any local recursive scheme. 

The readers who are not interested in the details of the mathematical background dis- 
cussed at length of the Section [III can regard the results of the current Section as a self- 
contained prescription for the computation of the desired cost function. 

For clarity we formulate the sequential steps of our algorithm keeping in mind two trial 
sequences of nucleotides of lengths m and n with m = n = 75. These sequences are depicted 
in the FigJU These sequences will be aligned in two ways being considered as linear and 
cactus-like ( "RNA-like" ) . The free energy (i.e. the cost function) of two sequences of total 
lengths m and n is 

F m ,n = Tin G m)n (20) 

AUCGAUGUAGGGUCACCGGGCUUAUGUUACGACGAGAUGUCUUGUUCGAUCAUGCGCUUCCGCGGAGAGUGGAAA - s1 
AGUUGCACCGCCAGACUACUUAACUAAACGUCGGCCAAGACAAUUCGCAUCGACCUAGUUAGCACGCACCAUCGA - s2 

Figure 4: Two trial sequences of m = n = 75 nucleotides. 



A. Matching of linear sequences 

Suppose for the time being that both sequences in FigJU are linear. Construct the matrix 
G whose elements Gij (1 < i < m; 1 < j < n) are the partition functions satisfying the 
relation ( TTCTj) — (TL3~j) with the boundary conditions G m ,o = Co, n = Go,o = 1 (see ()9])). The 
matrix element Gij defines matching of i first nucleotides of the 1st sequence with j first 
nucleotides of the 2nd one. The effective energy of two complimentary nucleotides in the 
( ITOj) is u = 1, while for non-complimentary ones is v = 0. It is easy to see from (jl~T|) that 



the search of G m ^ n can be completed in polynomia 
the standard dynamic programming algorithm 25 



time ~ 0{mn). At T — > we recover 



26|] (see ©). 
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B. Matching of RNA type sequences 

Suppose now that both sequences in FigJUcan form hierarchical cactus-like (i.e. "RNA- 
type") structures. The computation of the free energy of the complex built by the pair of 
RNA-type sequences can be accomplished in two sequential steps: 

• Compute the matrices g^ 1 ' and (of sizes m x m and n x n) of statistical weights 
of 1st and 2nd sequences separately. Rewrite ffT9l) as 



r=i s=i+l+£ ^ ' 



where gf] is the statistical weight of the loop from the nucleotide i till the nucleotide j 



''■j 



in the 1st (a = 1) or 2nd (a = 2) sequence. For each a = 1, 2 the systems of equations 
fl2T]) are quadratic in gf^ and can be solve recursively. The boundary conditions 



together with the recursion scheme f[2~lj) uniquely define the elements glf + i- Knowing 
<7i"+i and applying f l2~T|) again, we compute g^ +2 - The elements g^ with i > j are 
set equal to zero. The free energy (the cost function) of the hierarchical cactus-like 
structure is defined by (1201 ). 

• Knowing the matrices g^ and g^ find the elements Gij of the matrix G by solving 

(Hi. 

The ground state free energy F = F(T = 0) (i.e. the binding free energy at zero's tem- 
perature) for cactus-like structures can be explicitly computed by extending the approach, 
developed in Section III CI The zero-temperature free energies F m>n of branching structures 
read (compare to Eqs. (TT5]) — (TT6]) ) : 



F m , n = max 

1=1, . . . ,m 



fl,m + /l,n> Qi,j (22) 

j=l,...,n 

where f^- = T\ng^ (a = 1,2) are the free energies of individual subsequences from the 
nucleotide i till the nucleotide j, and Qij is the zero-temperature limit of the term in 
Eq.(PSD: 

Qij = i^-ij-i + /ff lim + + ^ (23) 

At T = one can write 



fij = max 



1 , . . . , i 

s=i+l+f,...,3 



fr+l,s-l + /s+l,j + fyy^ (24) 



The values fjij define the matching constants of linear sequences (as in (TT6]) ). while rjf^ are 
the matching constants in each separate sequence. 
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The boundary conditions for the ground state free energy follow from the boundary 
conditions of the partition function (TIB]) : 

' ^0,0 = 0; 

< F it0 = fW; l<i<m (25) 

Thus, to compute the ground state free energy of the complex of two RNA-like sequences, 
we should first reconstruct the matrices p 1 ' and of individual chains by applying Eq. (l24|) 
and then find the matrix F by using Eq. ([2"2~j) . The boundary conditions (j25p together with 
Eq. (l23p allow us to compute the elements of the matrix Q for m = 1 and any n. Knowing 
the corresponding matrix Q we define the elements F\j (1 < j < n) of the free energy matrix 
by using Eq. fl22|) . Then we proceed recursively and determine the matrix Q for m = 2 and 
any n, compute again F 2 j (1 < j < n) etc. 

For the sequence depicted in FigJUwe have found the following values of the ground state 
free energies: 

F\(T = 0) = 48 for linear structure 

F C (T = 0) = 51 for cactus-like structure with a = 3/2 and t > 3 

The obtained values coincide with the total number of complimentary pairs in formed (linear 
or cactus-like) structures. The discussion of the temperature behavior of the free energy, 
F(T), is given in the Appendix lAl 

IV. STRUCTURE RECOVERY 

In this Section we describe the implementation of the structure recovery algorithm for 
linear and cactus-like structures by the corresponding matrices of free energies F at zero 
temperature. Let us point out that due to the degeneration mentioned above, the restored 
sequence is one among the ensemble of sequences with the same free energy. 

A. Finding the Longest Common Subsequence for linear chains 

Sequence matching problem for linear structures consists in finding the longest common 
subsequence (possible with gaps) of two given sequences of nucleotides. Let us demonstrate 
on simple example how the algorithm works. Consider two sequences of m = n = 6 nu- 
cleotides and construct the incidence matrix rj with t]ij = 1 if monomers i of the 1st sequence 
and j of the second one match each other, and r] i; j = otherwise - see Fig|5h. In FigJ5b 
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Figure 5: (a) Incidence matrix r], (b) ground state free energy matrix F. 



we have shown the matrix of ground state free energies, F, computed via the recursion 
algorithm ( TT5|) (TT6]) . 

In order to see which nucleotides form links, let us proceed as follows. Take the element 
Fij of the matrix F and compare its value to the values of three neighboring matrix elements 
Fjj.i. Now we take the following decisions: 

• If Fi_ Xj j_i = max [F i _ij_ 1 , F i _ 1 j, i^-i] then i of the 1st sequence is linked to j of the 
2nd one; 

• If Fi_ij = max [Fi-ij-i, -Fj-ij, -Pij-i] then we skip the element % in the 1st sequence; 

• If F%j-\ = max [Fi-ij-i, i^-ij, then we skip the element j in the 2nd sequence. 

This procedure begins with the element F m n . 

This prescription for computing the matrix of ground state free energies shown in FigJSb 
gives (due to degeneration) many sequences with the same value of the free energy. Two 
possible realizations are depicted in Figj6j 



B. Finding the secondary structure for interacting RNA like chains 

The structure recovery for the chains with cactus-like structures is much more involved 
problem, however it can also be described recursively. In this case the algorithm consists of 
the following successive steps: 

• Begin with the element F mn and use ( 122|) . If F mn > /{'^/fj we consider the matrix 
Q (Eg. ( 123]) ) and chose the maximal element Q VA of the matrix Q which corresponds 
to pairing between the nucleotide p of the 1st sequence and nucleotide q of the second 
one; 



Sequence matching algorithms and pairing of noncoding RNAs 15 





G 


C 


G 


G 


A 


A 


c 


1 


1 


1 


1 


1 


1 


G 


1 


2 


2 


2 


2 


2 


U 


1 


2 


2 


2 


3 


3 


U 


1 


2 


2 


2 


3 


4 


C 


1 


2 


3 


3 


3 


4 


C 


1 


2 


3 


4 


4 


4 



( a ) GCGGAA 
CGUDCC 





G 


C 


G 


G 


A 


A 


c 


1 


1 


1 


1 


1 


1 


G 


1 


2 


2 


2 


2 


2 


U 


1 


2 


2 


2 


3 


3 


U 


1 


2 


2 


2 


3 


4 


C 


1 


2 


3 


3 


3 


4 


C 


1 


2 


3 


4 


4 


4 



( b ) GCGGAA 
CGUUCC 



Figure 6: (Color online) Structure recovery algorithm for linear chains. 



• For F p _i s _i consider the corresponding matrix Q (Eq. (l23j) ). chose the maximal ele- 
ment, <5max of this matrix and compare it with the value F = F — m +fjj+i n^Vp^'i 
Fq = F myn (on the next step we use F instead of F ). Now, 

— If Qmax = F, we look for the next pair (s,r) of linked nucleotides and proceed 
analogously; 

— If Qmax < F, then (according to Eq.( l22p ) there are on any more pairs of linked 
nucleotides in the considered branching structure. 

• Knowing pairs of linked nucleotides, for example, (p,q) and (s,r), we reconstruct the 
structure of the loops between the paired nucleotides by the corresponding statistical 
weights fp]) and f$ . 

The sequence of operations for the structure recovery of RNA-like chains is schematically 
depicted in the figure UJ 

Below we demonstrate on simple example how this algorithm works. Take two sequences 
SI and S2 as shown in FigJHJ The corresponding incidence matrices if (for intra-matching 
Sl-Sl), rf 1 (for intra-matching S2-S2), and n (for inter-matching S1-S2) are shown in FigJH] 
(a), (b) and (c) correspondingly. 

The matrices of effective statistical weights /W and of first and second sequences, 
as well as the ground-state free energy matrix F, are shown in the Figj9] (a), (b) and (c). 
The elements f m +i,j and f n +ij, which formally present in the computations, are set to zero: 
fm+l,j = fn+l,j = for all j. 

By comparing FigJHKb with FigJHb we see that since f$ = f[ 2 ^ = 2 and F = F 7 j = 6, 
we have F-jj > f[j + f[ 2 j. According to the algorithm described, write the matrix Q 
corresponding to the element Fjj. This matrix Q is depicted in FigJTOk. (Recall that each 
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F„„ = max (/« + f$, Qi , ) QiJ = Fi _ UH + + fjZ + ^ /« = ™* + + 1,) 



if F > f (l) + f (2) 

11 1 m,n J\,m T J \,n 



for F m>M write matrix Q and chose g max = Q p q 



consider F plqV write matrix Q 





v 

chose element g max = Q r s 












compare Q rs with F = F mn + (fg^+fgl + \ q ) 



if g ( . s = F, then search for 
next pair and consider F r _ x s _ x 



if Q <F, then the END 



Figure 7: Structure recovery algorithm for RNA molecules. 



element Fij has its own matrix Q of size % X j). We show only those matrices Q which are 
used for the structure recovery. 

The maximal element of the matrix Q depicted in FigJTUk is Q 7 j = 6, meaning that the 
7th nucleotide of SI interacts with the 7th nucleotide of S2. 

To find the next pair of interacting monomers, consider the matrix Q corresponding 
to the element F 66 . This matrix Q is depicted in FigJTOb. It has two maximal elements: 
Q 3j6 = Q 3 i = 5. Thus one has degeneration for the structure under consideration. According 
to our algorithm, the choice of Q 3> i means the interaction of the 3rd nucleotide of the 1st 
sequence with the 1st nucleotide of the 2nd sequence. At this stage the recovery process 
is completed. For the choice Q 36 we compute F^ = F — (/g 1 ^ + + 777,7). Since 
f^j = fg 2 j = and 777,7 = 1, we see that Q 3 6 = F^\ This means that the 3rd monomer 
of Si and the 6th monomer of S2 constitute the next interacting pair. Now we consider 
F 2j5 . The corresponding matrix Q is shown in the FigJTUb. We see that Q max = = 2; 
fS = !; fifi = °; *kfi = 1. Since, as before, F^ = F« - (/$ + /$ + 773,6), we see that 
^2,4 < F^ . Thus, the 2nd and 4th nucleotides do not interact and in the structure there are 
no more interacting nucleotides. The loop structures can be reconstructed by corresponding 
statistical weights - see FigfTTl 
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Figure 8: Incidence matrices for pairs of chains with possible clover-leaf structures inside each 
sequence: (a) intra-matching Sl-Sl; (b) intra-matching S2-S2; (c) inter-matching S1-S2. 
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Figure 9: (Color online) Algorithm description: Energies corresponding to incidence matrices in 
FigEJ Statistical weights of the 1st (a) and 2nd (b) sequences; (c) Ground-state free energy matrix. 



The proposed algorithm is applied to the longer trial sequences shown in FigJH Namely, 
we have performed the structure recovery for three different cases: for linear chains (a) 
(for them we use the algorithm described in the part 1), for cactus-like chains (b) and for 
cactus-like chains with the restriction on the size of the minimal loop length (c) (there are 
no loops less than 4 nucleotides). These structures are depicted in the figures FigJT2k.b and 
c correspondingly. 
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Figure 10: (Color online) Algorithm description: Matrices Q corresponding to: a) Fj 7; b) Fqq; c) 

^2,5- 
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Figure 11: Algorithm description: Structures recovered from the pair of short sequences shown in 
FigEJ 



V. CONCLUSION 



In this paper we have developed and implemented a new statistical algorithm for quan- 
titative determination of the binding free energy of two heteropolymer sequences under 
the supposition that each sequence can form a hierarchical cactus-like secondary structure, 
typical for RNA molecules. For the sequences of lengths m and n the search algorithm is 
completed in time ~ 0(m 2 x n 2 ). 

We have offered in Section HTT1 a constructive way to build a "cost function" characterizing 
the matching of two noncoding RNAs with arbitrary primary sequences. Since base-pairing 
of two ncRNAs or between ncRNA and DNA plays very important biological role, it is 
worth estimating theoretically the binding free energy of the ncRNA-target RNA complex 
by knowing the primary sequences of chains under consideration. Note, that this problem 
differs from the complete alignment of two RNA sequences: in ncRNA case we align only 
the sequences of nucleotides which constitute pairs between two RNAs, while the secondary 
structure of each RNA comes into play only by the combinatorial factors affecting the entropc 
contribution of chains to the total cost function. 

The proposed algorithm is based on two facts: i) the standard alignment problem can be 
reformulated as a zero-temperature limit of more general statistical problem of binding of 
two associating heteropolymer chains; ii) the last problem can be straightforwardly general- 
ized onto the sequences with hierarchical cactus-like structures (i.e. of RNA-type). Taking 
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AUCGAUGUAGGGUCACCGGGCUUAUGUUACGACGAGAUGUCUUGUUCGAUCAUGCGCUUCCGCGGAGAGUGGAAA - s1 
AGUUGCACCGCCAGACUACUUAACUAAACGUCGGCCAAGACAAUUCGCAUCGACCUAGUUAGCACGCACCAUCGA - s2 

(a) 



Tl 



AUCGAUGUAGGGUCACCGGGCUUAUGUUACGACGAGAUGUCUUGUUCGAUCAUGCGCUUCCGCGGAGAGUGGAAA - s1 

///// 



AGUUGCACCGCCAGACUACUUAACUAAACGUCGGCCAAGACAAUUCGCAUCGACCUAGUUAGCACGCACCAUCGA - s2 
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Figure 12: (Color online) Structures recovered from the pair of sequences shown in FigUJ (a) linear 
structure; (b) branching structure; (c) branching structure with the restriction on the size of the 
minimal loop (there are no loops less than 4 nucleotides). 



zero-temperature limit at the very end we arrive at the desired ground state free energy 
with account for entropy of side cactus-like loops. 

In this paper we have also demonstrated in detail (see Section lIVp how our algorithm 
enables to solve the structure recovery problem, which is in some sense, "inverse" with 
respect to finding the best matching of two ncRNAs. In particular, we can predict in zero- 
temperature limit the cactus-like (i.e. the secondary) structure of each ncRNA by knowing 
only their primary sequences. 

In addition we have performed the statistical analysis of a pair of linear and RNA-type 
random sequences. To avoid the congestion of the paper by the details of computations we 
have presented these results in Appendix IB] 
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Appendix A: Temperature dependence of the free energy 



Analyzing the temperature dependence of the free energy, F(T) for linear and cactus-like 
chains and have found some significant differences. The figure [13] demonstrates the F(T)- 
dependencies of the trial sequences shown in Fig JH under the condition that they form linear 
or hierarchical cactus-like structures (with loop factor for ideal chains, a = 3/2, and with 
the restrictions on the minimal length, £, of the loop). 
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Figure 13: Temperature dependence of the free energy of random trial sequence for linear structures 
(O) and RNA-like structures (A - with £ > 3). 

At high temperatures the F(T)-dependencies for linear and cactus-like (with the minimal 
loop's length I = 3) structures are almost identical. This signals that the creation of any 
loop of length I > 3 becomes entropically unfavorable. 

At sufficiently low T the F(T)-dependencies for linear and cactus like structures (with 
£ > 3) deviate from each other. This deviation has rather transparent physical explanation. 
Represent the free energy F(T) at T — > in the following generic form 

F = F + T\nW + Te~ u/T (Al) 

where F is the ground state energy and W is the number of states with the same energy 
(degeneration). According to (lAip the slope of the curve F(T) at T — » determines the 
degree of the degeneration. Decrease of the slope for cactus-like structures indicates that the 
creation of hierarchical "cactuses" and account for entropy of loops removes the degeneration. 
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Appendix B: Statistical analysis of a pair of random sequences 



We have analyzed the basic statistical properties of a pair of random sequences. For 
simplicity, we considered the chains of the same length n. It has been shown in 29J that for 
linear sequences the ground state free energy in the so-called "Bernoulli matching approxi- 
mation" has the following behavior at n ^> 1: 



--n + /(c) ( X ) n 1/3 



a 



1 + 



(Bl) 



sec 



29| for details), c is the number of different nucleotides (in 



where /(c) 

our case c = 4) and x is some random variable with known n-independent distribution 
((X) = -1.7711... and (x 2 ) - (xf = 0.8132...). 
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Figure 14: Plots of the average free energy, (F(n)} (linear scale), and its fluctuations, cr(n) (double 
logarithmic scale) in zero-temperature limit for: a) linear chains and b) cactus-like chains. 

In the FigJT4h we have plotted (F(n)) (in the linear scale) and u{n) (in the double 
logarithmic scale). One sees that the slope k\ ~ 0.65 in Figil^h is in very good agreement 
with the value k\ = lim — — > | computed from the 1st of equations (IBlh . while the slope 

0.38 in the FigJT4~b is close to the exponent \ in the 2nd line of (IBip . The averaging has 
been performed over 200 different randomly chosen structures with uniform distribution of 
c = 4 nucleotides. 
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The similar analysis have been performed for sequences with the possibility of cactus- 
like structure formation. The plots of (F(n)) and cr(n) are shown in the figures [T4b (in 
linear scale and in double logarithmic scale correspondingly). One sees that again, as for 
linear sequences, (F(n)) = k c n for large n, but the coefficient k c ~ 0.92 is larger than k\ 
what signals the large number of pairs in the ground state, leading to the loop creation. 
The slope in the FigJHH allows one to conclude that the loop creation does not affect the 
universality class of the fluctuations and it remains the same as for linear sequences. 



[1] V. Ambros, Cell 107, 862 (2001) 
[2] G. Storz, Science 296, 1260 (2002) 

[3] P. Navarro, S. Pichard, C. Ciaudo, P. Avner, C. Rougeulle, Genes & Development 19 1474 
(2005) 

[4] S.R. Eddy, Cell 109, 137 (2002) 

[5] V. Pande, A. Grosberg, T. Tanaka, Rev. Mod. Phys. 72, 259 (2000) 

[6] E. Rivas, S.R. Eddy, J. Mol. Biol. 285, 205 (1999) 

[7] S.B. Needleman and CD. Wunsch, J. Mol. Biol. 48, 443 (1970) 

[8] T.F. Smith and M.S. Waterman, J. Mol. Biol. 147, 195 (1981); Adv. Appl. math. 2, 482 
(1981) 

[9] M.S. Waterman, L. Gordon, and R. Arratia, Proc. Natl. Acad. Sci. USA, 84, 1239 (1987) 
[10] S.F. Altschul et. al., J. Mol. Biol. 215, 403 (1990) 

[11] D. Sankoff and J. Kruskal, Time Warps, String Edits, and Macromolecules: The theory and 

practice of sequence comparison (Addison Wesley, Reading, Massachussets, 1983) 
[12] A. Apostolico and C. Guerra, Alogorithmica, 2, 315 (1987) 
[13] R. Wagner and M. Fisher, J. Assoc. Comput. Mach. 21, 168 (1974) 

[14] D. Gusfield, Algorithms on Strings, Trees, and Sequences (Cambridge University Press, Cam- 
bridge, 1997) 

[15] J. Boutet de Monvel, European Phys. J. B 7, 293 (1999); Phys. Rev. E 62, 204 (2000) 

[16] V. Chvatal and D. Sankoff, J. Appl. Probab. 12, 306 (1975) 

[17] J. Deken, Discrete Math. 26, 17 (1979) 

[18] J.M. Steele, SIAM J. Appl. Math. 42, 731 (1982) 

[19] V. Dancik and M. Paterson, in STACS94, Lecture Notes in Computer Science, 775, 306 

(Springer: New York, 1994) 
[20] K.S. Alexander, Ann. Appl. Probab. 4, 1074 (1994) 

[21] M. Kiwi, M. Loebl, and J. Matousek, in Lecture Notes in Computer Science, 2976 302 

(Springer: Berlin, 2004) 
[22] M. Zhang and T. Marr, J. Theor. Biol. 174, 119 (1995) 
[23] T. Hwa and M. Lassig, Phys. Rev. Lett. 76, 2591 (1996) 



Sequence matching algorithms and pairing of noncoding RNAs 23 



[24] R. Bundschuh, Eur. Phys. J. B 22, 533 (2001) 

[25] M.S. Waterman, Bull. Math. Biol. 46, 473 (1984) 

[26] M.S. Waterman and M. Vingron, Statistical Science, 9, 387 (1994) 

[27] R. Bundschuh, T. Hwa, Discrete Appl. Math. 104, 113 (2000). 

[28] D. Drasdo, T. Hwa, M. Lassig, J. Comp. Biol. 7, 115 (2000) 

[29] S.N. Majumdar and S. Nechaev, Phys. Rev. E 69, 011103 (2004). 

[30] C.A. Tracy and H. Widom, Comm. Math. Phys. 159, 151 (1994); see also Proc. of ICM, 
Beijing, Vol. I, 587 (2002). 

[31] For a recent review of the appearence of Tracy-Widom distribution in several physics prob- 
lems, see S.N. Majumdar (Les Houches lecture notes on 'Complex Systems', 2007), arXiv: 
|cond-mat/0701193| 

[32] M.V. Tamm, S.K. Nechaev, Phys. Rev. E 78, 011903 (2008) 

[33] L. Comtet, Advanced Combinatorics: The Art of Finite and Infinite Expansions, (Dordrecht: 

Reidel, 1974) 
[34] M. Mueller, Phys. Rev. E, 67, 021914 (2003) 

[35] A.M. Gutin, A.Yu. Grosberg, E.I. Shakhnovich, J.Phys. A: Math. Gen. 26, 1037 (1993) 
[36] P. de Gennes, Biopolymers, 6, 715 (1968) 

[37] I.Ya. Erukhimovich, Vysokomolek. Soed., 20B, 10 (1978) (in Russian) 

[38] This general description covers both cases (DNA and RNA) by a straightforward redefinition 
of letters. 

[39] The cooperativity means that if two links are connected with each other, then the two adjacent 
links have larger probability to be also connected. 



